Add a new configuration option to control table header row inference #161

SomeBottle · 2024-12-07T08:03:22Z

For the instance that a table does not contain a header defined by <thead> or <th> , markdownify will use the first row of the table as header fallback by default.

However, sometimes I expect the first row to be parsed as a part of the tbody, for example:

<table>
    <tbody>
        <tr>
            <td>John</td>
            <td>123-456-7890</td>
            <td>john@example.com</td>
        </tr>
        <tr>
            <td>Bob</td>
            <td>987-654-3210</td>
            <td>bob@example.com</td>
        </tr>
    </tbody>
</table>

I expect it to be converted as follows:

|     |     |     |
| --- | --- | --- |
| John | 123-456-7890 | john@example.com |
| Bob | 987-654-3210 | bob@example.com |

I think I may not be the only one who has encountered this problem, so I added a new option table_header_fallback to control this behavior.

table_header_fallback is set to True by default, and the html above will be converted to:

| John | 123-456-7890 | john@example.com |
| --- | --- | --- |
| Bob | 987-654-3210 | bob@example.com |

If I set table_header_fallback=False, the result will be:

|  |  |  |
| --- | --- | --- |
| John | 123-456-7890 | john@example.com |
| Bob | 987-654-3210 | bob@example.com |

I've added this option to the ArgumentParser in main.py (alongwith another --escape-misc that may be forgotten), and I also created corresponding test cases.

Hope this will help, thank you for your work!

chrispy-snps · 2025-01-01T17:57:09Z

@SomeBottle - thanks for the pull request! I was not aware of this behavior, and I would want your proposed behavior in our own application. Thanks for catching this and proposing a solution.

I have some suggestions:

Could you consider renaming table_header_fallback to table_infer_header?
Could you move the missing --escape-misc argument to a separate pull request?

@AlexVonB, @matthewwithanm - my preference would be to not infer <th> header rows by default so that round-trips between HTML and Markdown are symmetrical. What do you think?

matthewwithanm · 2025-01-02T02:10:58Z

@AlexVonB, @matthewwithanm - my preference would be to not infer <th> header rows by default so that round-trips between HTML and Markdown are symmetrical. What do you think?

yeah, that behavior is surprising to me

SomeBottle · 2025-01-02T02:23:21Z

Thank you for your response! I've renamed the option and removed --escape-misc from argument parser in the recent commit.

Should I change the default value of the option to False? Looking forward to your reply.

chrispy-snps · 2025-01-02T11:14:19Z

@SomeBottle - yes, please change the default to not infer header rows from data rows unless specified by the user.

SomeBottle · 2025-01-02T12:19:32Z

OK, I just set the default value of the option to False and changed corresponding test cases and docs.

chrispy-snps · 2025-01-02T13:06:20Z

@SomeBottle - thanks!

Looking through your code, I see this:

        is_headrow = (
            all([cell.name == 'th' for cell in cells])
            or (el.parent.name == 'thead'
                # avoid multiple tr in thead             # <----
                and len(el.parent.find_all('tr')) == 1)  # <----
        )

What is the thinking behind the check for multiple <tr> elements? Is the thinking that if a <thead> contains multiple <tr> rows, they are probably truly data rows instead of header rows?

SomeBottle · 2025-01-02T13:57:37Z

@SomeBottle - thanks!

Looking through your code, I see this:
        is_headrow = (
            all([cell.name == 'th' for cell in cells])
            or (el.parent.name == 'thead'
                # avoid multiple tr in thead             # <----
                and len(el.parent.find_all('tr')) == 1)  # <----
        )
What is the thinking behind the check for multiple <tr> elements? Is the thinking that if a <thead> contains multiple <tr> rows, they are probably truly data rows instead of header rows?

Yes, considering the possibility of such an unusual situation, which is actually an invalid header for markdown table, I think it should be treated as a header missing case, and the rows should be regarded as data rows.

chrispy-snps · 2025-01-19T01:04:06Z

@SomeBottle - could you rebase/merge this branch to the latest state of develop? Thanks!

SomeBottle · 2025-01-19T08:51:41Z

@chrispy-snps OK, I've rebased the branch.

chrispy-snps · 2025-01-19T13:13:39Z

@SomeBottle - thanks for your contribution!

…matthewwithanm#161) Add option to infer first table row as table header (defaults to false)

chrispy-snps changed the title ~~Add new option for table header fallback.~~ Add a new configuration option to control tabler header row inference Jan 19, 2025

SomeBottle and others added 7 commits January 19, 2025 16:33

Add option for table header fallback

a308230

Add options to argument parser

768fada

Fix table header fallback option for thead and add tests

99da38e

Add more test cases

c37fa89

Rename option table_header_fallback to table_infer_header

f44a5c6

Default table_infer_header to False

8feee1a

some minor cosmetic changes

123fdaa

SomeBottle force-pushed the somebottle-develop branch from bfb1c35 to 123fdaa Compare January 19, 2025 08:46

chrispy-snps merged commit 3bf0b52 into matthewwithanm:develop Jan 19, 2025
1 check passed

chrispy-snps changed the title ~~Add a new configuration option to control tabler header row inference~~ Add a new configuration option to control table header row inference Feb 24, 2025

Wuhall pushed a commit to Wuhall/python-markdownify that referenced this pull request May 21, 2025

Add a new configuration option to control tabler header row inference (…

a14c0c4

…matthewwithanm#161) Add option to infer first table row as table header (defaults to false)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add a new configuration option to control table header row inference #161

Add a new configuration option to control table header row inference #161

Uh oh!

SomeBottle commented Dec 7, 2024

Uh oh!

chrispy-snps commented Jan 1, 2025

Uh oh!

matthewwithanm commented Jan 2, 2025

Uh oh!

SomeBottle commented Jan 2, 2025 •

edited

Loading

Uh oh!

chrispy-snps commented Jan 2, 2025

Uh oh!

SomeBottle commented Jan 2, 2025

Uh oh!

chrispy-snps commented Jan 2, 2025

Uh oh!

SomeBottle commented Jan 2, 2025 •

edited

Loading

Uh oh!

chrispy-snps commented Jan 19, 2025

Uh oh!

SomeBottle commented Jan 19, 2025

Uh oh!

Uh oh!

chrispy-snps commented Jan 19, 2025

Uh oh!

Uh oh!

Add a new configuration option to control table header row inference #161

Add a new configuration option to control table header row inference #161

Uh oh!

Conversation

SomeBottle commented Dec 7, 2024

Uh oh!

chrispy-snps commented Jan 1, 2025

Uh oh!

matthewwithanm commented Jan 2, 2025

Uh oh!

SomeBottle commented Jan 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chrispy-snps commented Jan 2, 2025

Uh oh!

SomeBottle commented Jan 2, 2025

Uh oh!

chrispy-snps commented Jan 2, 2025

Uh oh!

SomeBottle commented Jan 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chrispy-snps commented Jan 19, 2025

Uh oh!

SomeBottle commented Jan 19, 2025

Uh oh!

Uh oh!

chrispy-snps commented Jan 19, 2025

Uh oh!

Uh oh!

SomeBottle commented Jan 2, 2025 •

edited

Loading

SomeBottle commented Jan 2, 2025 •

edited

Loading